Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# Hi IODS, I am eager to learn about open data, open science and data science. I expect to improve my skills on data analysis and interpretation. I also hope to learn new things. This course was highly recommended by a collegue who attended it in the past. So I am excited and ready for learning.
date()
## [1] "Mon Nov 28 18:26:34 2022"
# I am familiar to R but I found some information and suggestion in the book that might be useful.I usually save R file and folders for a particular project in a designated project folder, but never have an RStudio project. This might be usefult to try and see the difference compared to my approach. I am also familiar to most of the contents in chapter 1-4. However it was fun to run the codes and realize ways to improve my codes and plots.
date()
## [1] "Mon Nov 28 18:26:34 2022"
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Nov 28 18:26:34 2022"
#This week I studied chapter 7 "Linear regression" from the book R for health data science, available here: https://argoshare.is.ed.ac.uk/healthyr_book/chap07-h1.html
#I am also learning data wrangling, which means how to manage large datatables and create small, managable working size tables. After that I am learning regression analysis and model validation with univariate and multivariate regression models.
Data wrangling (max 5 points)
#The codes for this exercise is available inside data folder along with the file created
#create_learning2014.R
#assignment2_regression_and_model_validation_data_wrangling.csv
Analysis (max 15 points)
Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url: https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/learning2014.txt . (The separator is a comma “,” and the file includes a header). Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-2 points)
#This dataset comes from survey during the study "The relationship between learning approaches and students' achievements in an introductory statistics course in Finland, 2015". The results of the study are available here: https://www.slideshare.net/kimmovehkalahti/the-relationship-between-learning-approaches-and-students-achievements-in-an-introductory-statistics-course-in-finland
learning2014 <- read.csv("data/assignment2_regression_and_model_validation_data_wrangling.csv")
dim(learning2014)
## [1] 166 7
#166 rows and 7 columns
#I am using a shorter version of the dataset produced by data wrangling. Here is the structure of the dataset:
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
#Each column is described below:
# Gender represents Caterogical variable with Male (M) and Female (F)
# Age represents Age (in years) derived from the date of birth
# Attitude is numerical variable representing Global attitude toward statistics
# deep is numerical variable representing Deep approach calculated by taking mean of several other variable output from the survey that aimed to understand about seeking Meaning, relating ideas and use of evidence
# stra is numerical variable representing Strategic approach calculated by taking mean of several other variables that aimed to understand about Organized Studying and Time Management.
# surf is numerical variable representing Surface approach calculated by taking mean of several other variables that aimed to understand about Lack of Purpose, Unrelated Memorising and Syllabus-boundness.
# Points is numerical variable with integers representing Exam points
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
#Here is a summary table of the above dataset:
summary(learning2014)
## gender Age Attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# In this study female participants were almost double in number compared to male participants. Based on the summary analysis deep, surface, strategic questions and points are normally distributed as their mean and median values are very close. Age distribution seems to be scewed towards left, i.e younger participants (Standard deviation: 21 - 27 years old), with a few older students (maximum 55 years old).
#Here is a graphical overview from the dataset for better understanding of the relationship of available variables:
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# Draw the plot
p <- ggpairs(learning2014, mapping = aes(col = gender),
lower = list(combo = wrap("facethist", bins = 20))) + theme_bw()
p
# Male students are more likely to show high attitude towards statistics.
# There was almost no correlation between age and other variables.
# Male students show significant negative correlation in Surface approach and their attitude as well as response to deep approach questions.
# There was significant negative correlation between attitude and points scored for both genders.
# Also negative correlation is observed in both genders points scored and their response to deep and surface approach, although the results were not significant.
Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent, outcome) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)
# fit a linear model
my_model <- lm(Points ~ Attitude + stra+ surf, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## Attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
#In this multivariate model the most important factor affecting the student's final exam result appears to be only *Attitude* (p-value less than 0)! There is no significant relationship between strategic and surface approach to points scored. The R-squared value of 0.2074 implies that the model can explain 20% or about one-fifth of the variation in the outcome.
# fit a linear model with variangle with statistically significant relationship with Points, i.e Attitude
my_model <- lm(Points ~ Attitude, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## Attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R-squared of the model. (0-3 points)
# Target variable 'Points' has statistically significant relationship with explanatory variable 'Attitute' (P-value less than 0). At estimated (theoretical) attitude value of 0 the points scored would be 11.64 with a standard error 1.83. For every point of attitude increased, there is 3.53 more exam points scored with a standard error of 0.57.
#R-squared is always between 0 and 1, 0 (representing 0%) indicates that the model explains none of the variability of the response data around its mean. 1 (representing 100%) indicates that the model explains all the variability of the response data around its mean. Multiple R-squared is used for evaluating how well the model fits the data. In this case, Multiple R-squared value of 0.1906 implies that the model can explain only 19.06% of the variation in the outcome.
#This is how the model looks
library(ggplot2)
qplot(Attitude, Points, data = learning2014) + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
## `geom_smooth()` using formula 'y ~ x'
Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)
#R makes it easy to graphically explore the validity of your model assumptions. If you give a linear model object as the first argument to the `plot()` function, the function automatically assumes you want diagnostic plots and will produce them. You can check the help page of plotting an lm object by typing `?plot.lm` or `help(plot.lm)` to the R console.
#In the plot function you can then use the argument `which` to choose which plots you want. `which` must be an integer vector corresponding to the following list of plots:
#which | graphic
#----- | --------
#1 | Residuals vs Fitted values
#2 | Normal QQ-plot
#3 | Standardized residuals vs Fitted values
#4 | Cook's distances
#5 | Residuals vs Leverage
#6 | Cook's distance vs Leverage
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#Residuals vs Fitted values
plot(my_model, which = 1)
#The Residuals vs Fitted plot studies the variance of the residuals. This plot represents the deviation of the observed values of an element of a statistical sample from its theoretical value. The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest.
#Here, the linearity of the regression model is supported by homoscedasticity, or homogeneity of variances. Homoscedasticity is an assumption of equal or similar variances in different groups being compared. In this model we see that residuals appear close to 10 points difference (positive and negative) from the fitted values. There are three exceptional observations 145, 56 and 35.
#Normal QQ-plot
plot(my_model, which = 2)
#A Q-Q (quantile-quantile) plot is a probability plot. It is used for comparing two probability distributions by plotting their quantiles against each other. Here, the two distributions being compared are the fitted model vs the observed values. Our model is normally distrubuted. There are some non fitting observations at the extremes of the distribution. These are expected to have very little impact on the model. The observations 145, 56 and 35 again appear as not good representations of the fitted model.
#Residuals vs Leverage
plot(my_model, which = 5)
#Here we plotted the standardized residuals vs the leverage of the fitted model. Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. This plot shows the impact of every single observation on the fitted model. There are some results that have a high impact on the model,suggesting that the results are driven by a few data points. Here again 56 and 35and a new observation 71 have influence but 145 does not have very high leverage.
After completing all the phases above you are ready to submit your Assignment for the review (using the Moodle Workshop below). Have the two links (your GitHub repository and your course diary) ready!
date()
## [1] "Mon Nov 28 18:26:44 2022"
#Read the joined student alcohol consumption data into R either from your local folder (if you completed the Data wrangling part) or from this url (in case you got stuck with the Data wrangling part):
#https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv
#(In the above linked file, the column separator is a comma and the first row includes the column names). Print out the names of the variables in the data and describe the data set briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-1 point)
library(readr)
#I have created the file in data wrangling part. But still reading from both data wrangling and online source, just to check
setwd("/Users/admin_adhisadi/workspace/School_work/IODS-project")
alc <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/alc.csv", show_col_types=FALSE)
alc2 <- read.csv("data/math_por.csv", sep = ",", header=TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
colnames(alc2)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(alc)
## spec_tbl_df [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities: chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
str(alc2)
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
# Analysis done using one of the datasets
On week 3, we are working with a new dataset about student performance. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. More information available here: https://archive.ics.uci.edu/ml/datasets/Student+Performance ‘alc_use’ column shows average of weekday and weekend alcohol consumption (Dalc + Walc) ‘high_use’ column was created using ‘alc_use’ column. TRUE is used for students for which ‘alc_use’ is greater than 2 (and FALSE otherwise)
Variable names included for the analysis and structure of dataset:
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(alc)
## spec_tbl_df [370 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ school : chr [1:370] "GP" "GP" "GP" "GP" ...
## $ sex : chr [1:370] "F" "F" "F" "F" ...
## $ age : num [1:370] 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr [1:370] "U" "U" "U" "U" ...
## $ famsize : chr [1:370] "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr [1:370] "A" "T" "T" "T" ...
## $ Medu : num [1:370] 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num [1:370] 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr [1:370] "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr [1:370] "teacher" "other" "other" "services" ...
## $ reason : chr [1:370] "course" "course" "other" "home" ...
## $ guardian : chr [1:370] "mother" "father" "mother" "mother" ...
## $ traveltime: num [1:370] 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num [1:370] 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr [1:370] "yes" "no" "yes" "no" ...
## $ famsup : chr [1:370] "no" "yes" "no" "yes" ...
## $ activities: chr [1:370] "no" "no" "no" "yes" ...
## $ nursery : chr [1:370] "yes" "no" "yes" "yes" ...
## $ higher : chr [1:370] "yes" "yes" "yes" "yes" ...
## $ internet : chr [1:370] "no" "yes" "yes" "yes" ...
## $ romantic : chr [1:370] "no" "no" "no" "yes" ...
## $ famrel : num [1:370] 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num [1:370] 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num [1:370] 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num [1:370] 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num [1:370] 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num [1:370] 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num [1:370] 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr [1:370] "no" "no" "yes" "yes" ...
## $ absences : num [1:370] 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num [1:370] 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num [1:370] 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num [1:370] 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num [1:370] 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi [1:370] FALSE FALSE TRUE FALSE FALSE FALSE ...
## - attr(*, "spec")=
## .. cols(
## .. school = col_character(),
## .. sex = col_character(),
## .. age = col_double(),
## .. address = col_character(),
## .. famsize = col_character(),
## .. Pstatus = col_character(),
## .. Medu = col_double(),
## .. Fedu = col_double(),
## .. Mjob = col_character(),
## .. Fjob = col_character(),
## .. reason = col_character(),
## .. guardian = col_character(),
## .. traveltime = col_double(),
## .. studytime = col_double(),
## .. schoolsup = col_character(),
## .. famsup = col_character(),
## .. activities = col_character(),
## .. nursery = col_character(),
## .. higher = col_character(),
## .. internet = col_character(),
## .. romantic = col_character(),
## .. famrel = col_double(),
## .. freetime = col_double(),
## .. goout = col_double(),
## .. Dalc = col_double(),
## .. Walc = col_double(),
## .. health = col_double(),
## .. failures = col_double(),
## .. paid = col_character(),
## .. absences = col_double(),
## .. G1 = col_double(),
## .. G2 = col_double(),
## .. G3 = col_double(),
## .. alc_use = col_double(),
## .. high_use = col_logical()
## .. )
## - attr(*, "problems")=<externalptr>
###hypothesis
#The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)
Basically with this dataset we are studying relationship between student’s behavior and alcohol consumption. I am taking these 4 variables: famsize, Pstatus, activities and absences. Here are description of the columns and my hypothesis. My hypothesis mostly comes from stereotypes.
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ -
greater than 3)
Big family size could mean extended family as uncle aunt grandparents or
more children in the family, depends on the culture I think. I assume
here big family size means more number of children. Hence my hypothesis
is that there may be higher alcohol consumption with increased family
size. This could be due to several factors such as difficulty in
managing financial and emotional needs of each children by the
parents.
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart) Students raised by single parent have increased alcohol consumption, similar reasons as above: difficulty in managing financial and emotional needs of child.
activities - extra-curricular activities (binary: yes or no) Increased extra-curricular activities means active and busy life and more social support. Hence there is less likelihood of alcohol consumption.
absences - number of school absences (numeric: from 0 to 93) Increased number of school absences means there may be several problems in the student’s life. Hence there is increased likelihood of alcohol consumption. In addition all of the above factors, increased family size, parents living apart, and low extra curricular activities is likely to contribute to higher absences along with increased alcohol consumption.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gmodels)
library(ggplot2)
CrossTable(alc$high_use, alc$famsize, dnn = c("High alcohol consumption", "Family size"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | Family size
## High alcohol consumption | GT3 | LE3 | Row Total |
## -------------------------|-----------|-----------|-----------|
## FALSE | 192 | 67 | 259 |
## | 0.181 | 0.462 | |
## | 0.741 | 0.259 | 0.700 |
## | 0.722 | 0.644 | |
## | 0.519 | 0.181 | |
## -------------------------|-----------|-----------|-----------|
## TRUE | 74 | 37 | 111 |
## | 0.422 | 1.078 | |
## | 0.667 | 0.333 | 0.300 |
## | 0.278 | 0.356 | |
## | 0.200 | 0.100 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 266 | 104 | 370 |
## | 0.719 | 0.281 | |
## -------------------------|-----------|-----------|-----------|
##
##
p1 <- alc %>%
ggplot(aes(x = famsize, fill = high_use)) +
geom_bar()
p2 <- alc %>%
ggplot(aes(x = famsize, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")
library(patchwork)
p1 + p2
Large family size does not mean high alcohol consumption.
CrossTable(alc$high_use, alc$Pstatus, dnn = c("High alcohol consumption", "Parental status"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | Parental status
## High alcohol consumption | A | T | Row Total |
## -------------------------|-----------|-----------|-----------|
## FALSE | 26 | 233 | 259 |
## | 0.014 | 0.002 | |
## | 0.100 | 0.900 | 0.700 |
## | 0.684 | 0.702 | |
## | 0.070 | 0.630 | |
## -------------------------|-----------|-----------|-----------|
## TRUE | 12 | 99 | 111 |
## | 0.032 | 0.004 | |
## | 0.108 | 0.892 | 0.300 |
## | 0.316 | 0.298 | |
## | 0.032 | 0.268 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 38 | 332 | 370 |
## | 0.103 | 0.897 | |
## -------------------------|-----------|-----------|-----------|
##
##
p1 <- alc %>%
ggplot(aes(x = Pstatus, fill = high_use)) +
geom_bar()
p2 <- alc %>%
ggplot(aes(x = Pstatus, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")
p1 + p2
Parent’s cohabitation status does not seem to contribute to alcohol consumption.
CrossTable(alc$high_use, alc$activities, dnn = c("High alcohol consumption", "Extra-curricular activities"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 370
##
##
## | Extra-curricular activities
## High alcohol consumption | no | yes | Row Total |
## -------------------------|-----------|-----------|-----------|
## FALSE | 120 | 139 | 259 |
## | 0.224 | 0.210 | |
## | 0.463 | 0.537 | 0.700 |
## | 0.670 | 0.728 | |
## | 0.324 | 0.376 | |
## -------------------------|-----------|-----------|-----------|
## TRUE | 59 | 52 | 111 |
## | 0.523 | 0.490 | |
## | 0.532 | 0.468 | 0.300 |
## | 0.330 | 0.272 | |
## | 0.159 | 0.141 | |
## -------------------------|-----------|-----------|-----------|
## Column Total | 179 | 191 | 370 |
## | 0.484 | 0.516 | |
## -------------------------|-----------|-----------|-----------|
##
##
p1 <- alc %>%
ggplot(aes(x = activities, fill = high_use)) +
geom_bar()
p2 <- alc %>%
ggplot(aes(x = activities, fill = high_use)) +
geom_bar(position = "fill") +
ylab("proportion")
p1 + p2
Extra-curricular activities does not seem to affect alcohol consumption.
library(ggplot2)
#let's plot absences as it is numeric
p1 <- ggplot(alc, aes(x= absences, fill=high_use)) + geom_density(alpha=0.4)
p1+ theme_bw()
###logistic regression
#model
m <- glm(high_use ~ (alc$famsize == "GT3") + (activities == "no") + (Pstatus == "A") + absences, data = alc, family = "binomial")
#odds ratio
OR <- coef(m) %>% exp
#confidence interval
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
#summary
summary(m)
##
## Call:
## glm(formula = high_use ~ (alc$famsize == "GT3") + (activities ==
## "no") + (Pstatus == "A") + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2960 -0.8113 -0.7217 1.2363 1.8858
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1310 0.2708 -4.177 2.96e-05 ***
## alc$famsize == "GT3"TRUE -0.3660 0.2568 -1.425 0.15404
## activities == "no"TRUE 0.2847 0.2350 1.211 0.22575
## Pstatus == "A"TRUE -0.2759 0.4029 -0.685 0.49336
## absences 0.0900 0.0234 3.846 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 430.89 on 365 degrees of freedom
## AIC: 440.89
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) alc$famsize == "GT3"TRUE activities == "no"TRUE
## -1.13102023 -0.36599660 0.28471938
## Pstatus == "A"TRUE absences
## -0.27594749 0.08999505
#odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.3227039 0.1869931 0.5422024
## alc$famsize == "GT3"TRUE 0.6935052 0.4203051 1.1526892
## activities == "no"TRUE 1.3293889 0.8391496 2.1118619
## Pstatus == "A"TRUE 0.7588528 0.3313845 1.6273877
## absences 1.0941689 1.0473341 1.1480407
The results show that increased family size, parents living apart, and low extra curricular activities are not likely to contribute to increased alcohol consumption. OR = 1.09 shows that the odds that the student will be consuming alcohol in excess is nearly always. Confidence intervals for other variables are varying (0.18, highly unlikely to 2.11, highly likely). My hypothesis was not proven to be correct.
# predictions versus actual values
# predict() the probabilty of high_use
probability <- predict(m, type = "response")
# add the predicted probability to 'alc'
alc <- mutate(alc, probability = probability)
# use probability to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5 )
# tabulate target variable versus predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>%
prop.table() %>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67837838 0.02162162 0.70000000
## TRUE 0.26756757 0.03243243 0.30000000
## Sum 0.94594595 0.05405405 1.00000000
This shows that my hypothesis was bad. These common negative stereotypes I mentioned in my hypothesis should be avoided. The only variable that work from my model is absence, which is shown below:
# new model
m2 <- glm(high_use ~ absences, data = alc, family = "binomial")
# Odds ratios
OR2 <- coef(m2) %>% exp
# confidence intervals
CI2 <- confint(m2) %>% exp
## Waiting for profiling to be done...
#probability of high alcohol consumption
probability_2 <- predict(m2, type = "response")
# add predicted probability to 'alc'
alc <- mutate(alc, probability_2 = probability_2)
# use probability to make a prediction of high_use
alc <- mutate(alc, prediction_2 = probability_2 > 0.5 )
# Summary of the model
summary(m2)
##
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3598 -0.8209 -0.7318 1.2658 1.7419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.26945 0.16094 -7.888 3.08e-15 ***
## absences 0.08867 0.02317 3.827 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 434.52 on 368 degrees of freedom
## AIC: 438.52
##
## Number of Fisher Scoring iterations: 4
# Coefficients of the model
coef(m2)
## (Intercept) absences
## -1.26944532 0.08866504
# Odds ratios with their confidence intervals
cbind(OR2, CI2)
## OR2 2.5 % 97.5 %
## (Intercept) 0.2809874 0.2030806 0.3819961
## absences 1.0927146 1.0464586 1.1459894
The results show that students who are more absent from their classes are more likely consume excess alcohol. Let’s see how these two models compare with a likelihood ratio test:
## Analysis of Deviance Table
##
## Model 1: high_use ~ (alc$famsize == "GT3") + (activities == "no") + (Pstatus ==
## "A") + absences
## Model 2: high_use ~ absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 365 430.89
## 2 368 434.52 -3 -3.6311 0.3042
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability_2, y = high_use, col= prediction_2)) + xlab("probability")
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction_2) %>%
prop.table() %>%
addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68648649 0.01351351 0.70000000
## TRUE 0.27567568 0.02432432 0.30000000
## Sum 0.96216216 0.03783784 1.00000000
date()
## [1] "Mon Nov 28 18:26:52 2022"
Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’. Then include the file as a child file in your ‘index.Rmd’ file. Perform the following analysis in the file you created. Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. Details about the Boston dataset can be seen for example here. (0-1 points) https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
library(dplyr)
library(ggplot2)
library(GGally)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Boston dataset consists of housing Values in Suburbs of Boston. It has 506 rows and 14 columns. Here is the description of each column: crim: per capita crime rate by town.
zn: proportion of residential land zoned for lots over 25,000 sq.ft.
indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bk - 0.63)21000(Bk−0.63)2 where BkBk is the proportion of blacks by town.
lstat:lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
cor_matrix <- cor(Boston)
cor_matrix <- cor_matrix %>% round(digits = 2)
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")
library("psych") #for p-value table
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
cor_test_mat <- corr.test(cor_matrix)$p # Apply corr.test function
cor_test_mat
## crim zn indus chas nox
## crim 0.000000e+00 5.460004e-02 5.861585e-03 1.0000000 7.907814e-03
## zn 2.284457e-03 0.000000e+00 1.271309e-04 1.0000000 2.417195e-04
## indus 1.105960e-04 1.672775e-06 0.000000e+00 1.0000000 9.560598e-08
## chas 4.167137e-01 9.742994e-01 7.988462e-01 0.0000000 1.000000e+00
## nox 1.550552e-04 3.357216e-06 1.074225e-09 0.9200326 0.000000e+00
## rm 2.431550e-03 2.434360e-03 3.886776e-04 0.4966148 1.649605e-03
## age 6.401623e-04 3.007306e-07 1.054161e-07 0.9626125 5.329690e-09
## dis 5.756140e-04 2.863417e-07 5.617665e-08 0.9200861 2.056115e-09
## rad 1.066445e-06 2.865584e-04 1.712863e-06 0.5139848 4.791323e-06
## tax 2.762339e-06 1.722141e-04 1.424366e-07 0.5004301 1.058846e-06
## ptratio 1.738565e-03 8.983828e-04 7.449341e-04 0.2954829 4.565702e-03
## black 6.091967e-05 8.087550e-03 3.260073e-04 0.6344478 3.329302e-04
## lstat 5.858634e-05 7.173685e-05 8.898971e-07 0.5508139 4.162472e-06
## medv 1.706341e-04 8.170349e-04 4.791024e-05 0.3461764 2.671186e-04
## rm age dis rad tax
## crim 5.460004e-02 2.304584e-02 2.129772e-02 8.259000e-05 2.016508e-04
## zn 5.460004e-02 2.405844e-05 2.319368e-05 1.318169e-02 8.531706e-03
## indus 1.554711e-02 8.854952e-06 4.775015e-06 1.284648e-04 1.167980e-05
## chas 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## nox 4.948816e-02 4.636831e-07 1.809382e-07 3.353926e-04 8.259000e-05
## rm 0.000000e+00 5.460004e-02 1.170387e-01 5.132770e-02 2.777919e-02
## age 2.275002e-03 0.000000e+00 8.982624e-08 4.066218e-03 1.841719e-03
## dis 6.665341e-03 9.980694e-10 0.000000e+00 2.689767e-03 1.433697e-03
## rad 1.974142e-03 7.530033e-05 4.482945e-05 0.000000e+00 6.702657e-11
## tax 8.187115e-04 2.877686e-05 2.166334e-05 7.365557e-13 0.000000e+00
## ptratio 2.925326e-04 4.236675e-03 6.502152e-03 3.354635e-04 2.702118e-04
## black 1.717361e-02 1.503185e-03 1.809994e-03 2.139846e-05 2.902442e-05
## lstat 2.721164e-06 7.283252e-06 6.071380e-05 2.338660e-05 5.374022e-06
## medv 1.140787e-07 4.559715e-04 2.010551e-03 1.193301e-04 3.677290e-05
## ptratio black lstat medv
## crim 4.948816e-02 0.0034606865 3.398008e-03 8.531706e-03
## zn 2.874825e-02 0.1264815268 3.945527e-03 2.777919e-02
## indus 2.607269e-02 0.0142572745 7.030187e-05 2.826704e-03
## chas 1.000000e+00 1.0000000000 1.000000e+00 1.000000e+00
## nox 8.674833e-02 0.0142572745 2.955355e-04 1.282169e-02
## rm 1.318169e-02 0.2404305720 2.013662e-04 9.468528e-06
## age 8.473349e-02 0.0465987475 4.952611e-04 1.778289e-02
## dis 1.170387e-01 0.0494881609 3.460687e-03 5.132770e-02
## rad 1.425727e-02 0.0014336967 1.520129e-03 6.205164e-03
## tax 1.282169e-02 0.0018417188 3.708075e-04 2.243147e-03
## ptratio 0.000000e+00 0.1264815268 1.425727e-02 1.841719e-03
## black 7.905095e-03 0.0000000000 1.812792e-02 4.948816e-02
## lstat 3.240290e-04 0.0004770506 0.000000e+00 1.908560e-06
## medv 2.901205e-05 0.0016797274 2.219256e-08 0.000000e+00
corrplot(cor_matrix,p.mat = cor_test_mat,insig = "p-value")
This plot shows the correlation matrix. The areas of circles show the absolute value of corresponding correlation coefficients. Lighter shade of circle means decreasing correlation (white is no correlation). Darker shade of blue means increasing positive correlation and darker shade towards red means increasing negative correlation.The number inside the cells show p-values; p-values have been added only to those cells, where the correlation coefficients are not significant. We see both positive and negative correlations among several varaibles with significant p-value. We see high negative correlation of dis with indus, nox and age and that also between lstat and medv. We see highest positive correlation of rad with tax.
cor_matrix <- as.data.frame(cor_matrix)
p <- ggpairs(cor_matrix, mapping = aes(),
lower = list(combo = wrap("facethist", bins = 20))) + theme_bw()
p
We see bimodal distributions for most variables. Variable zn, rm, dis, black appear to be skewed right. Here is the summary of the variables:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim) #required for next step
# summary of the scaled crime rate
summary(Boston$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.08204 0.25651 3.61352 3.67708 88.97620
# create a categorical variable of the crime rate in the Boston dataset, Use the quantiles as the break points
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
table(crime)
## crime
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 127 126 126 127
#Drop the old crime rate variable from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
#Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
By scaling, we have subtracted the column means from the corresponding columns and divided the difference with standard deviation. \[scaled(x) = \frac{x - mean(x)}{ sd(x)}\] As a result in scaled output we see that for each variable, now the mean is 0; min, 1st Quadrant, 2nd Quadrant are negative numbers; 3rd, 4th quadrant and max are positive numbers.
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
Linear discriminant analysis:
# linear discriminant analysis on train set
#I am taking boston_scaled from here as somehow the lda function produces warning in my previous boston_scaled dataframe.
boston_scaled <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/boston_scaled.txt",sep=",", header = T)
boston_scaled$crime <- factor(boston_scaled$crime, levels = c("low", "med_low", "med_high", "high"))
ind <- sample(nrow(boston_scaled), size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2425743 0.2376238 0.2574257 0.2623762
##
## Group means:
## zn indus chas nox rm age
## low 0.8754215 -0.8990201 -0.11163110 -0.8477638 0.4452425 -0.8133654
## med_low -0.1029092 -0.2476803 0.01475116 -0.5274802 -0.1272015 -0.3197622
## med_high -0.3957140 0.1795083 0.25766519 0.3871551 0.1446929 0.4280506
## high -0.4872402 1.0170298 -0.04947434 1.0563615 -0.4161429 0.8051736
## dis rad tax ptratio black lstat
## low 0.8141096 -0.7029578 -0.7594912 -0.4088442 0.37253195 -0.75856240
## med_low 0.3125782 -0.5476071 -0.4628535 -0.0530769 0.32061553 -0.16367208
## med_high -0.4077221 -0.3833432 -0.2956245 -0.3560913 0.07176142 0.02128996
## high -0.8558070 1.6390172 1.5146914 0.7818116 -0.82209334 0.88956081
## medv
## low 0.498698437
## med_low 0.005154015
## med_high 0.224556619
## high -0.694206900
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.121975970 0.75581934 -0.94434567
## indus -0.009527359 -0.25110608 0.44253995
## chas -0.069539452 -0.09834948 0.07131595
## nox 0.411436184 -0.62113629 -1.31441586
## rm -0.092163784 -0.06633313 -0.17045902
## age 0.309486290 -0.24683938 -0.22706237
## dis -0.037874986 -0.20827343 0.19586038
## rad 2.969835546 0.90756338 -0.18520370
## tax 0.044962236 -0.12328153 0.63808147
## ptratio 0.134738484 0.14163378 -0.25827683
## black -0.138819258 0.01357610 0.12863179
## lstat 0.216659774 -0.25140130 0.25273527
## medv 0.203499057 -0.38107769 -0.25307255
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9505 0.0372 0.0123
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric. Leaving this as character gives error.
classes <- as.numeric(train$crime)
# Draw the LDA (bi)plot
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)
test <- boston_scaled[-ind,]
# save the crime categories from the test set
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes , predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 20 9 0 0
## med_low 6 22 2 0
## med_high 1 7 14 0
## high 0 0 0 21
plot(table(correct = correct_classes , predicted = lda.pred$class))
#wrong predictions in the (training) data
mean(correct_classes != lda.pred$class)
## [1] 0.245098
Here is the test error rate:
error <- mean(correct_classes != lda.pred$class)
errorrate <- error*100
paste0(errorrate, "%")
## [1] "24.5098039215686%"
Reload the Boston dataset and standardize the dataset (we did not do this in the Exercise Set, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line') + scale_x_continuous(breaks=seq(1, 10, 1))
The value of total WCSS changes radically at 2. So two clusters would seem optimal
# k-means clustering
km <- kmeans(boston_scaled, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
This is impossible to intrepret. Lets try different ways to look at clusters, 1.by dividing the dataset to include only few columns, 2. each correlation separately (doing only two that seem to have correlation for now)
pairs(boston_scaled[, 1:7], col = km$cluster)
pairs(boston_scaled[, 8:14], col = km$cluster)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ psych::%+%() masks ggplot2::%+%()
## ✖ psych::alpha() masks ggplot2::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ MASS::select() masks dplyr::select()
clusters <- factor(km$cluster)
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled %>% ggplot(aes(x = indus, y = nox, col = clusters)) +
geom_point()
boston_scaled <- as.data.frame(boston_scaled)
boston_scaled %>% ggplot(aes(x =lstat, y = medv, col = clusters)) +
geom_point()
#ggpairs(boston_scaled, mapping = aes(col = clusters), lower = list(combo = wrap("facethist", bins = 20))) + theme_bw()
#commented because it is still crowded figure.
We see positive correlation between indus and nox for cluster 1 but no correlation for cluster 2. There is negative correlation between lstat and medv for both clusters.
(more chapters to be added similarly as we proceed with the course!)